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Abstract — We  consider  a  task  assignment  problem  for  a  fleet 
of  UAVs  in  a  surveillance/search  mission.  We  formulate  the 
problem  as  a  restless  bandits  problem  with  switching  costs  and 
discounted  rewards:  there  are  N  sites  to  inspect,  each  one  of 
them  evolving  as  a  Markov  chain,  with  different  transition 
probabilities  if  the  site  is  inspected  or  not.  The  sites  evolve 
independently  of  each  other,  there  are  transition  costs  c,;  for 
moving  between  sites  i  and  j  6  {1, . . . ,  N},  rewards  when  visiting 
the  sites,  and  we  maximize  a  mixed  objective  function  of  these 
costs  and  rewards.  This  problem  is  known  to  be  PSPACE-hard. 
We  present  a  systematic  method,  inspired  from  the  work  of 
Bertsimas  and  Nino-Mora  1 1 1  on  restless  bandits,  for  deriving 
a  linear  programming  relaxation  for  such  locally  decomposable 
MDPs.  The  relaxation  is  computable  in  polynomial-time  offline, 
provides  a  bound  on  the  achievable  performance,  as  well  as 
an  approximation  of  the  cost-to-go  which  can  be  used  online 
in  conjunction  with  standard  suboptimal  stochastic  control 
methods.  In  particular,  the  one-step  lookahead  policy  based  on 
this  approximate  cost-to-go  reduces  to  computing  the  optimal 
value  of  a  linear  assignment  problem  of  size  N.  We  present 
numerical  experiments,  for  which  we  assess  the  quality  of  the 
heuristics  using  the  performance  bound. 

I.  INTRODUCTION 

In  the  past  decade  or  so,  teams  of  autonomous  collab¬ 
orating  unmanned  aerial  vehicles  (UAVs)  have  started  to 
be  actively  used  to  perform  various  tasks  of  surveillance, 
reconnaissance  and  information  gathering  in  remote  or  dan¬ 
gerous  environments.  Yet,  the  problem  of  assigning  tasks 
to  these  vehicles  is  in  general  a  complex  stochastic  control 
problem,  and  there  is  still  a  need  to  develop  efficient  methods 
aimed  at  optimizing  the  performance  of  these  multi-agent 
systems.  The  multi-armed  bandit  (MAB)  problem  has  long 
been  recognized  as  an  important  model  to  help  researchers  in 
this  task,  because  of  its  generality  coupled  with  an  efficiently 
computable  solution.  Gittins  devote  a  chapter  of  his  book 
[2]  to  the  application  of  the  MAB  model  to  search  theory. 
Whittle  introduced  one  of  the  most  interesting  extension, 
the  restless  bandits  (RB)  problem  [3],  describing  a  potential 
application  to  a  fleet  of  aircrafts  trying  to  track  the  positions 
of  enemy  submarines.  As  long  as  the  targets  are  assumed  to 
evolve  independently  (which  is  usually  done  for  tractability), 
the  sensor  management  problem  becomes  essentially  a  multi¬ 
armed  bandit  or  restless  bandits  problem  where  one  controls 
the  information  states  of  the  targets  [4],  [5], 

Unlike  the  basic  multi-armed  bandit  problem  however, 
an  optimal  solution  to  the  more  expressive  restless  bandits 
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problem  is  unlikely  to  be  computable  efficiently,  since  the 
problem  is  known  to  be  PSPACE-hard  [6].  Moreover,  in 
the  case  of  moving  sensors  onboard  UAVs,  an  important 
component  to  take  into  account  in  the  objective  function  are 
switching  penalties  for  changing  targets,  adding  a  traveling 
salesman-like  feature  to  an  already  difficult  problem  (the 
multi-armed  bandit  problem  with  switching  costs  is  NP- 
hard). 

In  this  paper,  we  extend  our  previous  work  on  polynomial¬ 
time  relaxations  of  the  restless  bandits  problem  with  switch¬ 
ing  costs  (RBSC)  [7],  from  the  single  agent  to  the  multi-agent 
case.  In  Section  II,  we  formulate  the  RBSC  problem  in  the 
framework  of  Markov  decision  processes  (MDP).  In  Section 
III,  we  show  how  the  local  structure  of  the  objective  function 
(decomposable  by  sites  to  inspect)  and  the  assumption  on 
the  independent  evolution  of  the  states  of  the  sites  allow  us 
to  derive  systematically  a  relaxation  for  this  problem.  The 
performance  measures  used  in  a  similar  way  in  [1]  for  the 
restless  bandits  problem  are  always  marginals  of  the  state- 
action  frequencies  appearing  in  the  exact  formulation.  The 
relaxation  provides  an  efficiently  computable  bound  on  the 
achievable  performance.  Section  IV  describes  how  this  relax¬ 
ation  can  be  used  to  create  heuristics  to  solve  the  problem 
in  practice,  and  presents  numerical  experiments  comparing 
the  heuristics  to  the  performance  bound.  Section  V  presents 
a  preliminary  analysis  toward  an  a  priori  performance  bound 
for  our  heuristics. 

II.  MULTI-AGENT  RBSC:  EXACT  FORMULATION 
A.  The  State-Action  Frequency  Approach  to  MDPs 

In  this  section,  we  first  review  the  linear  programming 
approach  based  on  occupation  measures  to  formulate  Markov 
decision  processes  (MDP).  The  RBSC  problem  is  then  for¬ 
mulated  in  this  framework.  A  (discrete-time)  MDP  is  defined 
by  a  tuple  {X,A,  t^,c}  as  follows: 

•  X  is  the  finite  state  space. 

•  A  is  the  finite  set  of  actions.  A(x)  C  A  is  the  subset 
of  actions  available  at  state  x.  =  { (at, cz)  :  x  €  X,a  G 
A(x)}  is  the  set  of  state-action  pairs. 

•  are  the  transition  probabilities.  &xay  is  the  proba¬ 
bility  of  moving  from  state  x  to  state  y  if  action  a  is 
chosen. 

•  r  :  JP  — >  K.  is  an  immediate  reward. 

We  define  the  history  at  time  t  to  be  the  sequence  of 
previous  states  and  actions,  as  well  as  the  current  state: 
ht=  (xi,ai,X2,ct2,  ■  ■  ■  ,xt-i,at-i,xt).  Let  Hr  be  the  set  of  all 
possible  histories  of  length  t.  A  policy  u  in  the  class  of 
all  policies  U  is  a  sequence  (u\,U2,  ■  ■  ■)■  If  the  history  ht 


is  observed  at  time  t,  then  the  controller  chooses  an  action 
a  with  probability  ut(a\hf).  A  policy  is  called  a  Markov 
policy  (u  <E  I'm)  if  for  any  t,  u,  only  depends  on  the  state 
at  time  t.  A  stationary  policy  (u  £  Us)  is  a  Markov  policy 
that  does  not  depend  on  t.  Under  a  stationary  policy,  the  state 
process  becomes  a  Markov  chain  with  transition  probabilities 
Pxy[u)  =  Y.aeA(x)  ^xayu(a\x).  Finally,  a  stationary  policy  is 
a  deterministic  policy  (u  £  Ud )  if  it  selects  an  action  with 
probability  one.  Then  u  is  identified  with  a  map  u  :  X  — >  A. 

We  fix  an  initial  distribution  v  over  the  initial  states.  In 
other  words,  the  probability  that  we  are  at  state  x  at  time  1 
is  v(x).  If  V  is  concentrated  on  a  single  state  z,  we  use 
the  Dirac  notation  v(x)  =  <5-fxj.  Kolmogorov’s  extension 
theorem  guarantees  that  the  initial  distribution  v  and  any 
given  policy  u  determine  a  unique  probability  measure  P" 
over  the  space  of  trajectories  of  the  states  X,  and  actions  At. 
We  denote  E"  the  corresponding  expectation  operation. 

For  any  policy  u  and  initial  distribution  v,  and  for  a 
discount  factor  0  <  a  <  1,  we  define 


Ra(v,u)  =  (1  -  a) E“  £  a,-lr(X„At)  =  (1  -  a)  £  cW'E uvr(X„A,) 

r=l  (=1 

(the  exchange  of  limit  and  expectation  is  valid  in  the  case 
of  finitely  many  states  and  actions  using  the  dominated 
convergence  theorem). 

An  occupation  measure  corresponding  to  a  policy  u  is  the 
total  expected  discounted  time  spent  in  different  state-action 
pairs.  More  precisely,  we  define  for  any  initial  distribution 
V,  any  policy  u  and  any  pair  x  £  X.a  £  A  (x) : 

oo 

fa(v,u-,x,a)  :=  (1  -  a)  £  aI_1P“  (X,  =  x,At  =  a). 
r=i 


The  set  {fa(v,u',x,a)}x,a  defines  a  probability  measure 
/a(v,n)  on  the  space  of  state-action  pairs  that  assigns 
probability  fa(v,u;x,a)  to  the  pair  (x,a).  fa(v,u)  is  called 
an  occupation  measure  and  is  associated  to  a  stationary 
policy  w  defined  by: 


w(a\y) 


fa(v,u:y,g) 

LaeAfa(v,u-,y,a) 


,Vy  e  X,a  £  A(y), 


(1) 


whenever  the  denominator  is  non-zero,  otherwise  we  can 
choose  w(a\y)  arbitrarily.  We  can  readily  check  that 


Ra(v,u)=  Y,  £/a(v,M;x,a)r(x,a).  (2) 


We  define  for  any  class  of  policies  U\ 


Lj/j  ( v)  —  U„Gc/1/a(v,n), 

i.e.,  the  set  of  vectors  described  by  the  class  of  policies 
considered.  Also,  let  Qa(v)  to  be  the  set  of  vectors  p  £ 
satisfying 


J  Y,yexY.aeA(y) Py,a(^x(y)  <x2Py ax)  —  (1  ct)v(x'),Vx  £  X 
\py,«>0,  Vy<E  X,  a  €  A(y). 

(3) 

Qa(v)  is  a  closed  polyhedron.  Note  that  by  summing  the 
first  constraints  over  x  we  obtain  J\,a  py  a  =  1,  so  p  satisfying 
the  above  constraints  defines  a  probability  measure.  It  also 


follows  that  Qa(y )  is  bounded,  i.e.,  is  a  closed  polytope. 
One  can  check  that  the  occupation  measures  fa(v,u)  belong 
to  this  polytope,  i.e.,  L“(v)  C  Qa(v).  The  following  the¬ 
orem  states  that  Qa(v )  describes  in  fact  exactly  the  set  of 
occupation  measures  achievable  by  all  policies,  and  that  each 
policy  can  be  obtained  as  a  randomization  over  deterministic 
policies,  which  represent  the  extreme  points  of  the  polytope 
Qa (v).  See  [8]  for  a  proof  under  a  more  general  form. 

Theorem  1:  L“(v)  =L“s(v)  =  cowL“o(v)  =  Qa(v). 

Since  deterministic  policies  represent  the  extreme  points  of 
the  polytope  of  occupation  measures,  we  know  from  standard 
LP  theory  that  it  is  sufficient  to  look  among  the  deterministic 
policies  for  a  policy  maximizing  Ra(v,u).  Moreover,  one  can 
obtain  an  optimal  occupation  measure  corresponding  to  the 
maximization  of  (2)  as  the  solution  of  a  linear  program  over 
the  poly  tope  Qa(v). 

B.  Exact  Formulation  of  the  RBSC  Problem 

In  the  RBSC  problem,  N  projects  are  distributed  in  space 
at  N  sites,  and  M  <N  servers  can  be  allocated  to  M  different 
projects  at  each  time  period  t  =  1,2,...  In  the  following, 
we  use  the  terms  project  and  site  interchangeably;  likewise, 
agent  and  server  have  the  same  meaning  (they  are  the  UAVs 
in  our  application).  At  each  time  period,  each  server  must 
occupy  one  site,  and  different  servers  must  occupy  distinct 
sites.  We  say  that  a  site  is  active  at  time  t  if  it  is  visited  by  a 
server,  and  is  passive  otherwise.  If  a  server  travels  from  site  k 
to  site  /,  we  incur  a  cost  q/.  Each  site  can  be  in  one  of  a  finite 
number  of  states  x„  £  Sn ,  for  n  =  1  and  we  denote 

the  Cartesian  product  of  the  individual  state  spaces  SP  = 
Si  x  ...  x  Sn.  If  site  n  in  state  xn  is  visited,  a  reward  r^(xn)  is 
earned,  and  its  state  changes  to  y„  according  to  the  transition 
probability  /?'  .  If  the  site  is  not  visited,  then  a  reward 

(potentially  negative)  r®(xn)  is  earned  for  that  site  and  its 
state  changes  according  to  the  transition  probabilities  PxnVn  ■ 
We  assume  that  all  sites  change  their  states  independently  of 
each  other. 

Note  that  if  the  transition  costs  are  all  0,  we  recover  the 
initial  formulation  of  the  RB  problem  [3],  If  in  addition  the 
passive  rewards  are  0  and  the  passive  transition  matrix  is  the 
identity  matrix,  we  obtain  the  MAB  problem.  If  we  just  add 
the  switching  costs  to  the  basic  MAB  problem,  we  call  the 
resulting  model  MABSC. 

We  denote  the  set  {1,...,A}  by  [A].  We  consider  that 
when  no  agent  is  present  at  a  given  site,  there  is  a  fictitious 
agent  called  passive  agent  at  that  site.  We  also  call  the  real 
agents  active  agents,  since  they  collect  active  rewards.  The 
transition  of  a  passive  agent  between  sites  does  not  involve 
any  switching  cost,  and  when  a  passive  agent  is  present  at  a 
site,  the  passive  reward  is  earned.  Therefore,  we  have  a  total 
of  N  agents  including  both  the  real  and  passive  agents,  and 
we  can  describe  the  positions  of  all  agents  by  a  vector  s  = 
(.v | . .  ,Sjv),  which  corresponds  to  a  permutation  of  [N]  (due 
to  our  constraint  that  different  agents  must  occupy  different 
sites).  We  denote  the  set  of  these  pemutation  vectors  by  IT  y  . 
The  M  first  components  correspond  to  the  real  agents.  For 


example,  with  M  =  2  and  N  =  4,  the  vector  (si  =  2,si  = 
3,  S3  =  1,54  =  4)  £  n  4  means  that  agent  1  is  in  site  2,  agent 
2  in  site  3  and  sites  1  and  4  are  passive. 

For  an  agent  i  <E  [IV],  we  refer  to  the  other  agents  by  -i. 
If  we  fix  Si  £  [IV]  for  some  1  <  i  <  N,  then  we  write  s  , 
to  denote  the  vector  and  11^]-*,.  to 

denote  the  permutations  of  the  set  [N]  —  {s,}.  In  particular, 
we  write  Ls_,€ri[;V  to  denote  the  sum  over  all  permutations 
of  the  coordinates  of  the  agents  —  i,  over  the  set  of  sites 
not  occupied  by  agent  i.  We  also  write  ,9'-]  to  denote  the 
Cartesian  product  S\  x  . ..Si- 1  x  .S',+  i  x  . . .  x  Sn- 

The  state  of  the  system  at  time  t  can  be  described  by 
the  state  of  each  site  and  the  position  s  £  n^j  of  the 
servers,  including  the  passive  ones  (even  if  more  compact 
state  descriptions  are  possible,  our  choice  is  motivated  by  the 
formulation  of  the  relaxation  in  the  next  section).  We  denote 
the  complete  state  by  (x\ , . . .  .  .,sn)  :=  (x;s).  We  can 

choose  which  sites  are  to  be  visited  next,  i.e.  the  action  a 
belongs  to  the  set  II ^  and  corresponds  to  the  assignment 
of  the  agents,  including  the  passive  ones,  to  the  sites  for  the 
time  period.  Once  the  sites  to  be  visited  are  chosen,  there  are 
costs  cSiai  for  moving  the  active  agent  i  from  site  Sj  to  site 
a,-,  including  possibly  a  nonzero  cost  for  staying  at  the  same 
site.  The  reward  earned  is  £1,  raj(xaj)-  We 

are  given  a  distribution  v  on  the  initial  state  of  the  system, 
and  we  will  assume  a  product  form  v(xi , . . .  ,xn\s  1 , . . . ,  sn )  = 
Vi(xi) . . .  Vn{xn)8(11  (si)  •  •  •  SdN{sN),  i-e.,  the  initial  states  of 
the  sites  are  independent  and  server  i  leaves  initially  from 
site  dj,  with  d  £  n^j. 

The  transition  matrix  has  a  particular  structure,  since  the 
sites  evolve  independently  and  the  transitions  of  the  agents 
are  deterministic: 


M  N  N 


With  these  elements,  we  can  formulate  the  RBSC  problem 
with  multiple  agents  as  follows: 

maximize 

(M  JV  \ 

£(rai(*a,)-<V,.)  +  £  r%(xai)  )  P(x;s),a  (4) 

i—  1  i—M+l  J 

subject  to 

M  J V 

X  P(x; s),a-0!  Y  Y  P  N  ;sr )  ,s  IT Px's . xSj  O  P^.xs, 

aen^]  s'en^j  x'ey  /=  1  '  i=M+ 1  1  1 

N 

=  (1  -  a)  Vi(xi)8d.(si),  V  (x,s)  e9x  nM 
(—1 

P(x;s),a  —  h,  V((x;s),a)eyxn2M, 

with  the  decision  variables  P(x;s),a  corresponding  to  an  occu¬ 
pation  measure.  Note  that  the  formulation  above  is  of  little 
computational  interest  since  the  number  of  variables  and 
constraints  is  of  the  order  of  \9*\  x  (IV!)2,  that  is,  exponential 
in  the  size  of  the  input. 

III.  LINEAR  PROGRAMMING  RELAXATION 

The  complexity  result  known  for  the  restless  bandits  prob¬ 
lem  justifies  the  search  for  efficient  methods  that  approximate 


the  optimal  solution  of  the  RBSC  problem.  An  interesting 
feature  of  the  multi-armed  bandit  framework  is  that  it  leads 
naturally  to  a  Markov  decision  process  for  each  site.  This  was 
already  noticed  by  Whittle  [3]  (see  also  [9]),  whose  solution 
was  based  on  relaxing  the  hard  constraints  tying  the  project 
together,  enforcing  them  only  in  average. 

In  this  section  we  will  see  that  in  the  state-action  frequency 
domain,  a  relaxation  can  be  easily  obtained  by  considering 
specific  marginals  of  the  occupation  measure.  Once  the 
relaxation  is  obtained,  we  can  go  back  to  the  value  function 
domain  of  dynamic  programming  by  taking  the  dual  of 
the  relaxed  linear  program.  Then  it  becomes  clear  that  the 
essential  features  of  the  problem  that  allow  the  method  to 
work  are  the  separable  structure  of  the  objective  function  and 
the  independence  assumption  of  the  evolution  of  the  sites. 
When  the  coupling  between  the  sites  increases  (for  instance, 
when  we  introduce  switching  costs  to  the  RB  problem),  the 
relaxation  becomes  more  complicated  and  grows  in  size. 
In  general,  the  method  used  to  derive  a  relaxation  is  the 
following: 

(i)  identify  the  marginals  of  the  state-action  frequencies 
that  are  sufficient  to  express  the  objective  function  in 

(4). 

(ii)  express  the  constraints  on  these  marginals  by  partially 
summing  the  constraints  in  (4). 

(iii)  add  the  constraints  due  to  the  fact  that  these  marginals 
all  derive  from  the  same  state-action  frequency  vector. 

The  link  between  this  method  and  the  work  of  Bertsimas 
and  Nino-Mora  on  restless  bandits  [1]  was  also  highlighted 
in  our  previous  paper  [7].  As  we  illustrate  now,  this  method 
is  relatively  systematic  once  the  original  linear  program  has 
been  formulated,  and  in  principle  can  be  extended  to  derive 
relaxations  for  other  problems  with  a  certain  decomposable 
structure. 

We  start  by  rewriting  the  objective  function  and  we 
identify  the  relevant  marginals: 

M  N  N  N 

EE  E  rlaMaMai;ai+  E  E  E 

(■=  1  a,= lxaj  esaj  i=M+ 1  cij= 1  xaj  esaj 

M  N  N 

E  E  E  Gia/Tj.-a., 

/=  1  dj=  1  Sj=  1 

where  the  marginals  appearing  above  are  obtained  as  follows: 

P.Vfl,- =  E  EE  P(x;s),a 

a-ien[JV]-a,-  SGn[V]  X-aj&S-aj 

^Si.aj  E  E  E  P(x; s),a 

a-<en[iV]-a,-  s-;enpV]-j,- 

and  the  superscripts  refer  to  the  agents. 

Now  to  express  the  constraints  for  these  marginals,  it  turns 
out  that  the  following  variables  are  sufficient: 

p  =  E  E  E  P(x;s),a  >  (5) 

x-jeS-j  s_;6n[JV]_s.  a_,en[Ar]_a. 

\/xj  £  Sj,  V  ( i,j,Si,cii )  £  [IV]4 


Clearly  we  have 


Ts,;o,-  E  P(xj',si),ai 


Pxapdi  12  P(xai\Si),aj  ’ 
Ji=l 


E  P(,1W)A-  > v  e  [iv]4, 

A'lSSl 

(6) 

V  (i,ai,xai).  (7) 


Note  that  we  could  formulate  everything  in  terms  of  the 
variables  (5),  but  we  chose  to  introduce  some  additional 
variables  for  readability.  However,  the  second  equality  in 
(6)  is  important  to  obtain  a  sufficiently  strong  relaxation; 
it  expresses  a  compatibility  condition  that  is  clearly  true 
because  the  marginals  are  obtained  from  the  same  original 
distribution. 

For  agent  i,  some  x  G  5P  and  s  in  ITjV,  we  can  sum  the 
constraints  in  (4)  over  xSj  G  SSj  for  j  ^  i  to  get 
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where  1{;  <  M}  is  the  indicator  variable  of  the  set  {;  < 
M}.  Then  we  can  sum  over  Sj .  j  /  i,  for  a  fixed  ,v,.  rewrite 
1 Ea  ;( i  ty  ...  ^^*^1  Es'en[JV]  Esm  i  Es^  cn^-]  ,/ ! 

to  obtain 
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Last,  there  are  still  some  compatibility  conditions  between 
the  marginals  that  have  not  been  expressed.  To  obtain  a 
sufficiently  strong  relaxation,  we  want  to  take  into  account 
the  fact  that  no  two  agents  can  be  at  the  same  time  in  the 
same  site.  This  is  expressed  in  terms  of  marginals  as  follows: 


There  are  0(N4  x  max,- 15,-|)  variables  ,,  which  is 
polynomial  in  the  size  of  the  input.  This  number  if  inde¬ 
pendent  of  the  number  of  real  agents  in  the  instance  of  our 
problem.  Note  that  in  [7],  we  obtain  in  the  single  agent 
case  a  relaxation  with  0(N3  x  max,- 15,-|)  variables,  which 
is  therefore  preferable  to  (11)  in  the  case  M  =  1.  From  a 
careful  comparison  of  (11)  with  the  relaxation  obtained  in 
[7]  for  the  single  agent  case,  we  verified  that  when  M  =  1  the 
two  formulation  are  equivalent,  even  if  (11)  involves  more 
variables  and  constraints.  However,  (11)  has  the  advantage 
of  being  valid  for  any  number  of  agents,  and  this  number 
can  be  given  as  a  parameter  to  our  problem. 

IV.  HEURISTICS 
A.  One-Step  Lookahead  Policy 

Computing  the  optimum  value  of  the  relaxation  presented 
in  the  previous  section  provides  a  bound  on  the  performance 
achievable  by  any  assignment  policy.  It  is  also  useful  to  actu¬ 
ally  design  a  policy  for  the  system,  via  standard  suboptimal 
control  techniques. 

By  taking  the  dual  of  the  linear  program  obtained  from 
the  state-action  frequency  formulation,  we  obtain  a  linear 
program  whose  variables,  indexed  by  the  different  states  of 
the  system,  have  the  interpretation  of  the  value  function  for 
these  states  [8],  Indeed,  this  dual  program  can  be  obtained 
directly  from  Bellman’s  equation.  Now  we  can  take  the  dual 
of  our  relaxed  formulation  (11);  in  this  paper,  we  do  not 
consider  the  structure  of  this  dual  program  in  detail,  but  we 
note  that  the  dual  objective  function  involves  a  subset  of  the 
dual  variables  corresponding  to  the  constraints  (8)  and  can 
be  written 
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minimize  £  £  £  ^  , 
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E  E  P(*w)*=  E  E  pU),at’Vi’s’x°eS*-  (10) 

Si€[N]-s  a;G[JV]  ke[N]-i 

Intuitively,  on  the  left  hand  side  we  have  the  probability  that 
agent  i  does  not  go  to  a  site  a  (respectively  does  not  currently 
occupy  a  site  s),  which  must  equal  the  probability  that  some 
other  agent  k  (passive  or  not)  goes  to  site  a  (respectively 
occupy  site  s).  These  relations  can  be  verified  by  inspection 
of  (5).  Finally  the  relaxation  is: 
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where  the  dual  variables  of  interest  are  the  Xl  .  .  Now 

-VSj  )•*! 

consider  the  system  at  a  given  time,  for  which  the  state 
is  (xi,...,Xiv;si,...,Siv),  with  a  permutation  of 

[V] .  Given  the  interpretation  of  the  dual  variables  mentioned 
above,  it  is  natural  to  try  to  form  an  approximation  /(x;s) 
of  the  value  function  in  state  (x;s)  as 

N 

J(xi,...,xN-,si,...,sN)  =  E^.tvv  (12) 

i=  1 

where  s.  are  the  optimal  values  of  the  dual  variables 
obtained  from  the  LP  relaxation.  The  separable  form  of  this 
approximate  cost  function  is  useful  to  design  an  efficiently 
computable  one-step  lookahead  policy,  as  follows.  At  state 
(x;s),  we  obtain  the  assignment  of  the  agents  by  solving 


S(x;s)  G  argmaxaen[Af]  \  g((x; s),a)  +  a  E 
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lowing  maximization  problem: 
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(13) 


Assuming  that  the  optimal  dual  variables  have  been  stored  in 
memory,  the  maximization  above  is  actually  easy  to  perform. 
The  evaluation  of  one  parenthesis  involves  only  the  data  of 
the  problem  for  the  current  state  of  the  system,  and  one 
summation  over  the  states  of  one  site,  i.e.,  takes  a  time 
0(max;  |S/|).  Let  us  denote  the  terms  in  parenthesis  nijar 
All  these  terms  can  be  computed  in  time  (9(A2max,-  |S,j), 
and  (13)  can  be  rewritten: 


max 

a6n[jy] 
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This  is  a  linear  assignment  problem,  which  can  be  solved 
by  linear  programming  or  in  time  (9((V3)  by  the  Hungarian 
method  [10],  Thus,  the  assignment  is  computed  at  each  time 
step  in  time  (9((V2max;-  .S’,  +A3)  by  a  centralized  controller, 
which  needs  to  store  the  optimal  dual  variables  of  the 
relaxation  in  addition  to  the  parameters  of  the  problem. 


B.  Computational  Considerations 

The  major  bottleneck  limiting  the  use  of  our  method  for 
large-scale  problems  is  the  computation  of  the  relaxation 
(11)  which  involves  a  large  number  of  variables.  Hence, 
to  solve  a  problem  with  10  sites  which  can  each  be  in 
one  of  five  states  already  requires  solving  a  linear  program 
with  50000  variables  (independently  of  the  number  of  agents 
used).  This  relaxation  is  computed  off-line,  in  order  to  obtain 
a  bound  on  the  achievable  performance  and  the  optimal 
dual  variables.  Howevever,  the  limits  of  the  state-of-the-art 
linear  programming  technology  are  reached  for  relatively 
small  problems.  For  these  problems,  the  one-step  lookahead 
policy  described  in  the  previous  paragraph  is  easily  computed 
in  real-time.  If  more  time  is  available,  then  this  one-step 
lookahead  policy  can  sometimes  be  used  as  a  base  policy  to 
obtain  a  rollout  policy  [11],  whose  performance  is  known  to 
be  at  least  as  good  as  the  base  policy,  and  in  practice  offers 
interesting  improvements. 

It  is  also  interesting  to  consider  decentralized  algorithms. 
Assuming  that  each  agent  stores  only  his  own  optimal  dual 
variables  and  the  parameters  of  the  problems,  we  expect  that 
existing  work  on  the  distributed  computation  of  the  value  of 
the  assignment  problem  [12]  can  be  used.  We  will  explore 
this  direction  in  future  research. 


C.  Numerical  Experiments 

We  now  briefly  present  some  numerical  experiments  with 
the  proposed  policy.  We  can  compare  the  lower  bound  on  the 
optimal  reward  obtained  through  the  use  of  a  specific  policy 
to  the  upper  bound  obtained  from  the  relaxation.  We  also 
compare  the  one-step  lookahead  policy  to  a  simple  greedy 
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policy,  where  the  approximate  cost-to-go  /  is  taken  to  be 
0.  It  is  known  that  this  greedy  policy  is  optimal  for  the 
MAB  problem  (single  agent  case),  when  the  rewards  are 
deteriorating,  i.e.,  projects  become  less  profitable  as  they  are 
worked  on. 

Linear  programs  are  implemented  in  AMPL  and  solved 
using  CPLEX.  Due  to  the  size  of  the  state  space,  the  expected 
discounted  reward  of  the  various  heuristics  is  computed  via 
Monte-Carlo  simulations.  The  computation  of  each  trajectory 
is  terminated  after  a  sufficiently  large,  but  finite  horizon:  in 
our  case,  when  a1  times  the  maximal  absolute  value  of  any 
immediate  reward  becomes  less  than  10  6 .  Table  I  presents 
results  for  various  RBSC  problems.  The  number  of  states  per 
site  in  the  scenarios  varies  between  3  and  5.  We  adopt  the 
following  nomenclature: 

•  Z, :  optimal  value  of  the  relaxation. 

•  Zm/ :  estimated  expected  value  of  the  one-step  lookahead 
policy. 

•  Zg :  estimated  expected  value  of  the  greedy  policy. 

Problem  1  is  designed  to  make  the  greedy  policy  under¬ 
perform:  one  of  the  sites  has  a  higher  initial  reward  but  it 
is  also  very  costly  to  move  from  this  site  to  the  others  (the 
distances  are  asymmetric  here).  Note  that  in  this  problem 
the  one-step  lookahead  policy  performed  remarkably  well, 
and  in  general  it  is  not  as  clear  how  to  make  it  strongly 
underperform.  Problem  2  and  3  have  the  same  number  of 
sites  and  agents,  but  in  problem  2  the  switching  costs  are 
small  compared  to  the  rewards,  whereas  in  problem  3  the 
costs  and  rewards  are  of  the  same  order  of  magnitude.  These 
problems  confirm  that  high  transition  costs  have  a  dramatic 
importance  on  the  performance  of  the  greedy  policy.  If  they 
are  negligible  however,  it  is  not  clear  that  it  is  beneficial  to 
use  the  one-step  lookahead  policy.  Problems  4  and  5  have 
randomly  generated  data.  The  LP  relaxation  for  problem  5 
was  computed  in  about  30  minutes  on  a  relatively  recent 
desktop  running  CPLEX. 

V.  A  “PERFORMANCE”  BOUND 

In  this  section,  we  present  a  result  that  offers  some  insight 
into  why  we  expect  the  one-step  lookahead  policy  to  perform 
well  if  the  linear  programming  relaxation  of  the  original 
problem  is  sufficiently  tight.  Essentially,  we  can  follow  the 
type  of  analysis  presented  in  [13].  First,  we  note  that  the 
vector  /  in  (12)  is  a  feasible  vector  for  the  linear  program 
dual  of  (4).  That  is,  /  is  a  superharmonic  vector  for  the 
original  RBSC  problem,  and  we  recall  that  the  exact  optimal 


cost  is  the  smallest  superharmonic  vector  [14].  Writing  v  for 
the  vector  of  initial  probability  distribution,  we  see  then  that 

VTf>  VTJ*  (14) 

where  J*  is  the  optimal  value  function,  minimizing  the  dual 
of  (4). 

Proposition  2:  Let  fa(v,ii)  be  the  occupation  measure 
vector  associated  with  the  one-step  lookahead  policy  u,  and 
J,,  the  cost  associated  to  this  policy.  We  have 

VT(J*~Ju)  <  T^/a(v,«)r(/-/*)  (15) 

Proof:  In  the  following  we  denote  by  7},  and  T  (with 
TJ  =  max,,  T„J)  the  dynamic  programming  operators.  The 
cost  of  policy  u  is  given  by  Jr,  =  TrJa ,  he., 

/fi(x;s)  =s((x;s),M(x;s))  +  a  £  ^x«(x;s)x'-4(x',m(x; s)). 

Let  gr,  be  the  vector  of  size  AH  x  \Sf\  with  components 
g((x;s),w(x;s)),  and  P,,  the  stochastic  matrix  with  compo¬ 
nents  ^x«(x;s)x'-  We  take  a  compatible  ordering  of  the  states 
for  all  vectors  and  matrices,  and  we  have  then 

J,i  =  gll  T  P:tJ It  - 

Now  ( I  —  aP,, )  has  an  inverse  since  P,,  is  a  stochastic  matrix 
and  a  <  1,  so  we  get 

Ju  =  (I  —  aPa)~1gu- 

Under  policy  m,  the  state  of  the  system  evolves  as  a  Markov 
chain,  and  we  have  P"(X^  =  x,Sf  =  s)  =  [vTPat]xs. 

So  the  occupation  measure  (state  frequency)  is 

fa (v, u)  =  (1  -  a)  £  a'vrpi  =  (1  -  a)vT(I-aPfi)-1.  (16) 

f=0 

Now 

J-Ju  =  (/-  aPr,)-1  [(/-  aPr,)J-ga\  =  ( I-aP, -,)'1  [J-  ( gB  +  aPaJ )] . 

By  the  definition  of  the  lookahead  policy  ga  +  a  Iff  =  TJ. 
So  we  obtain 

J-Ja  =  (I-aPa)-1  [/-TV]  . 

Then  starting  from  (14)  and  using  (16) 

vT(J*-Ja)  <  vT (/-/„-)  <  T^ce/a(v,M)r(/-ry). 

Now  by  Bellman’s  theorem  and  the  fact  that  J  >  TJ,  we  get 
TJ>  T2J>  ...  >7*.  So 

VT(J*  -Ju)  <  r-^/a(v,«)r(/-7*)7 

which  is  the  inequality  in  the  proposition.  ■ 

In  words,  the  proposition  says  that  starting  with  a  distri¬ 
bution  v  over  the  states,  the  expected  difference  in  reward 
between  the  optimal  solution  and  the  one-step  lookahead 
policy  is  bounded  by  a  weighted  distance  between  the 
estimate  /  used  in  the  design  of  the  policy  and  the  optimal 
value  function  J* .  The  weights  are  given  by  the  occupation 
measure  of  the  one-step  lookahead  policy.  Note  that  this 
is  true  for  every  one-step  lookahead  policy  that  uses  a 


superharmonic  vector  as  an  approximation  of  the  cost-to- 
go.  In  future  research,  we  intend  to  investigate  the  structure 
of  the  relaxation  further  in  order  to  refine  the  upper  bound 
(15). 

VI.  CONCLUSIONS 

We  have  presented  a  linear  programming  relaxation  for 
the  restless  bandits  with  switching  costs  problem,  which 
extends  our  previous  work  from  the  single  agent  to  the  multi¬ 
agent  case.  This  framework,  motivated  by  a  multi-UAVs  task 
assignment  problem,  is  general  enough  to  model  a  wide 
range  of  dynamic  ressource  allocation  problems  of  relatively 
modest  size.  An  important  feature  of  the  method  is  that  it  au¬ 
tomatically  provides  a  bound  on  the  performance  achievable 
by  any  policy.  The  techniques  rely  on  the  separable  structure 
of  the  problem  and  should  be  useful  for  other  problems  with 
similar  structure.  We  designed  a  one-step  lookahead  policy 
based  on  this  relaxation,  which  can  be  implemented  in  real¬ 
time,  and  should  also  be  implementable  distributively.  Future 
work  will  focus  on  trying  to  obtain  a  better  characterization 
of  the  performance  of  our  heuristics. 

References 

[1]  D.  Bertsimas  and  J.  Nino-Mora,  “Restless  bandits,  linear  programming 
relaxations,  and  a  primal-dual  index  heuristic,”  Operations  Research, 
vol.  48,  pp.  80-90,  2000. 

[2]  J.  Gittins,  Multi-armed  Bandit  Allocation  Indices,  ser.  Wiley- 
Interscience  series  in  Systems  and  Optimization.  New  York:  John 
Wiley  and  sons,  1989. 

[3]  R  Whittle,  “Restless  bandits:  activity  allocation  in  a  changing  world,” 
Journal  of  Applied  Probability,  vol.  25  A,  pp.  287-298,  1988. 

[4]  R.  Washburn,  M.  Schneider,  and  J.  Fox,  “Stochastic  dynamic  pro¬ 
gramming  based  approaches  to  sensor  ressource  management,”  in 
Proceedings  of  the  International  Conference  on  Information  Fusion, 
2002. 

[5]  V.  Krishnamurthy  and  R.  Evans,  “Hidden  markov  model  multiarm 
bandits:  a  methodology  for  beam  scheduling  in  multitarget  tracking,” 
IEEE  Transactions  on  Signal  Processing,  vol.  49,  no.  12,  pp.  2893  - 
2908,  December  2001. 

[6]  C.  Papadimitriou  and  J.  Tsitsiklis,  “The  complexity  of  optimal  queue¬ 
ing  network  control,”  Mathematics  of  Operations  Research,  vol.  24, 
no.  2,  pp.  293-305,  1999. 

[7]  J.  Le  Ny  and  E.  Feron,  “Restless  bandits  with  switching  costs:  Linear 
programming  relaxations,  performance  bounds  and  limited  lookahead 
policies,”  in  American  Control  Conference,  Minneapolis,  MN,  June 
2006,  to  appear. 

[8]  E.  Altman,  Constrained  Markov  Decision  Processes.  Chapman  and 
Hall,  1999. 

[9]  D.  Castanon,  “Approximate  dynamic  programming  for  sensor  man¬ 
agement,”  in  Proceedings  of  the  36th  Conference  on  Decision  and 
Control,  December  1997,  pp.  1202-1207. 

[10]  A.  Schrijver,  Combinatorial  Optimization  -  Polyhedra  and  Efficiency. 
Springer,  2003. 

[11]  D.  Bertsekas  and  D.  Castanon,  “Rollout  algorithms  for  stochastic 
scheduling  problems,”  Journal  of  Heuristics,  vol.  5,  no.  1,  pp.  89- 
108,  1999. 

[12]  - ,  “Parallel  synchronous  and  asynchronous  implementations  of  the 

auction  algorithm,”  Parallel  Computing,  vol.  17,  pp.  707-732,  1991. 

[13]  D.  de  Farias  and  B.  V.  Roy,  “The  linear  programming  approach  to 
approximate  dynamic  programming,”  Operations  Research,  vol.  51, 
no.  6,  pp.  850-865,  2003. 

[14]  L.  Kallenberg,  “Survey  of  linear  programming  for  standard  and  non¬ 
standard  Markovian  control  problems.  Part  I:  Theory,”  ZOR  -  Methods 
and  Models  in  Operations  Research,  vol.  40,  pp.  1^-2,  1994. 


